Loan Defaults Report

This document is a data science report showing insights to features that can potentially lead to loan defaults.

General Information

Version : 0.7

Name : Loan Defaults Prediction Project

Purpose : Predicting if a loan is going to default or not

Date : 2025-03-06

Contributors : Charalambos Pittordis

Description : This work is a data science project that tries to predict if an invidual receiving a loan is going to end up into a loan default or not based on multiple features related to loan_amount, annual_income, purpose of loan and home_ownership status.

Source Code : TBC, on github


Dataset Information

Origin : All Lending Club Loan Data

Description : 2007 through current Lending Club accepted and rejected loan data

Depth : from 2007 to current date

Perimeter : only residential sales

Target Variable : loan_default

Target Description : loan default = [True, False]


Data Preparation

Variable Filetring : All variables containing outliers and those that required special knowledge or previous calculations for their use were removed

Missing Values : were replaced by the mean of their columns during feature engineering

Feature Engineering : No feature was created. All features were selected carefully, numerical features tranformed via StandardScaler whith Categorical features tranfromed via OneHotEncoding. Also applied Synthetic Minority Oversampling Technique (SMOTE) to handle over-sampling minorities within the OneHotEncoded categorical features i.e., classes underrepresented for purpose such as wedding and school.

Path To Script : TBC/ on github


Model Training

Used Algorithm : We used a XGBClassifier algorithm (XGBoost) but this model could be challenged with other interesting models such as LogisticsRegression, and Keras Deep Learning Neural Networks.

Parameters Choice : We did perform hyperparameter optimisation via GridSearchCV and chose to use n_estimators=200, max_depth=12, learning_rate=0.1, enable_categorical=True; as these parameters gave a good AUC-ROC score and no overfitting.

Metrics : Accuracy, Precision, Recall (Sensitivity), F1-Score, ROC-AUC, Confusion Matrix

Validation Strategy : We splitted our data into train (80%) and test (2%)

Path To Script : TBC, on github


Model analysis

Model used : XGBClassifier

Library : xgboost.sklearn

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Library version : 2.1.4

Model parameters :

Parameter key Parameter value
n_estimators 200
objective binary:logistic
max_depth 12
max_leaves None
max_bin None
grow_policy None
learning_rate 0.1
verbosity None
booster None
tree_method None
gamma None
min_child_weight None
max_delta_step None
subsample None
sampling_method None
colsample_bytree None
colsample_bylevel None
colsample_bynode None
reg_alpha None
reg_lambda None
Parameter key Parameter value
scale_pos_weight None
base_score None
missing nan
num_parallel_tree None
random_state 42
n_jobs None
monotone_constraints None
interaction_constraints None
importance_type None
device None
validate_parameters None
enable_categorical True
feature_types None
max_cat_to_onehot None
max_cat_threshold None
multi_strategy None
eval_metric None
early_stopping_rounds None
callbacks None
n_classes_ 2
_Booster

Dataset analysis

Global analysis

Training dataset Prediction dataset
number of features 20 20
number of observations 140,036 35,010
missing values 0 0
% missing values 0 0

Univariate analysis

fico_range_high - Numeric

Training dataset Prediction dataset
count 140,036 35,010
mean -0.143 -0.148
std 0.928 0.925
min -1.79 -1.65
25% -0.906 -0.906
50% -0.338 -0.37
75% 0.277 0.277
max 4.3 4.3

Target analysis

loan_default - Categorical

Training dataset Prediction dataset
distinct values 2 2
missing values 0 0

Multivariate analysis


Model explainability

Note : the explainability graphs were generated using the test set only.

Global feature importance plot

Features contribution plots

fico_range_high -


Model performance

Univariate analysis of target variable

loan_default - Categorical

True values Prediction values
distinct values 2 2
missing values 0 0

Metrics

Accuracy : 0.878

Precision : 0.909

Recall : 0.84

F1 Score : 0.873

ROC AUC : 0.878

Confusion Matrix :